Details regarding the data

-My data is deposited in GEO as GSE211993

-I have downloaded my raw sequences from SRA (40 FASTQ pair end can be found using ID PRJNA873113)

-My publication is The transcriptomic landscape of neurons carrying PSEN1 mutations reveals changes in extracellular matrix components and non-coding gene expression

-The data were generated by researchers involved in the studies cited, including Tubsuwan et al. (2016), Li et al. (2016), Pires et al. (2016), Poon et al. (2016), Shi et al. (2012), and Haukedal et al. (2022), among others involved in the generation, differentiation, and characterization of hiPSCs and the subsequent RNA sequencing and analysis.

-RNA data was extracted using the RNeasy® Plus Mini Kit (Qiagen, 74,134) according to the manufacturer’s protocol.

-Bulk Poly(A) and total RNA libraries were constructed for 5 replicates according to BGI protocols and sequenced with BGIseq (DNBseq, 2 × 100 nt, stranded paired-end).

-Cells of interest were human induced pluripotent stem cells (hiPSCs) were differentiated into glutamatergic forebrain neurons.

-The experimental condition involved the differentiation of hiPSCs into glutamatergic forebrain neurons using a modified version of a protocol exploiting dual SMAD inhibition with LDN-193189 as an analogue for Noggin, targeting mutations PSEN1 L150P and PSEN1 A79V through CRISPR-Cas9 gene editing.

-The sequencing was conducted with BGIseq (DNBseq, 2 × 100 nt, stranded paired-end).

Prerequisite - Conda environment

Create rna_seq_analysis.yaml to create conda environment with all necessary tools:

name: rna_seq_analysis
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - sra-tools=3.1.0
  - fastqc=0.12.1
  - multiqc=1.21
  - cutadapt=4.7
  - bbmap=39.06
  - star=2.7.11b
  - samtools=1.19.2
  - subread=2.0.6
  - wget=1.21.4

Run the following command to create new environment with specified name:

conda env create -f rna_seq_analysis.yaml

This environment contains all you need for the further analysis, just activate it via:

conda activate rna_seq_analysis

Prerequisite - Interactive shell/slurm

You can either run following scripts in interactive shell on Weill Cornell Cayuga server using:

srun -n1 --pty --partition=angsd_class --mem=40G bash -i

or you can edit scripts to run them using batch scheduler of your choice.

Prerequisite - Script location

All scripts are located at /athena/angsd/scratch/mij4011/homework/project

Prerequisite - Master Script

Master Script can be used to run all the tasks in a series to avoid unnecessary repetitiveness and improve reproducibility.

master.sh:

#!/bin/bash

# Stop on error
set -e

# Activate environment
conda activate rna_seq_analysis

# Step 1: Download Data
bash download_data.sh

# Step 2: Quality Control
bash fastqc.sh

# Step 3: Adapter Trimming
bash cutadapt.sh

# Step 4: rRNA Removal
bash silva_database_download.sh
bash run_bbduk.sh

# Quality control after processing
bash processed_multiqc.sh

# Step 5: Alignment
bash hg38.sh
bash index.sh
bash star.sh

# Quality check for mapping
bash mapping_quality.sh

# Step 6: Feature Counts
bash featurecounts.sh

echo "Analysis Complete"

1. Writing a script to download fastq files.

download_data.sh:

# ! / bin / bash -i

# Define an array of accession numbers from downloaded metadata file from SRA
ACCESSIONS=(
SRR21190736 SRR21190737 SRR21190738 SRR21190739 SRR21190740
SRR21190741 SRR21190742 SRR21190743 SRR21190744 SRR21190745
SRR21190746 SRR21190747 SRR21190748 SRR21190749 SRR21190750
SRR21190751 SRR21190752 SRR21190753 SRR21190754 SRR21190755
SRR21190756 SRR21190757 SRR21190758 SRR21190759 SRR21190760
SRR21190761 SRR21190762 SRR21190763 SRR21190764 SRR21190765
SRR21190766 SRR21190767 SRR21190768 SRR21190769 SRR21190770
SRR21190771 SRR21190772 SRR21190773 SRR21190774 SRR21190775
)

# Create a directory named "fastq" and move into it
mkdir -p fastq
cd fastq

# Loop through the accession numbers and download each one
for ACCESSION in "${ACCESSIONS[@]}"; do
    echo "Downloading $ACCESSION..."
    fasterq-dump "$ACCESSION" --split-files --skip-technical --threads 4
done

echo "Download completed."

# Move back to the original directory
cd ..

2. Quality control with FastQC and MultiQC

fastqc.sh:

#!/bin/bash

FASTQ_DIR="fastq"

# Output directory for FastQC results
OUTPUT_DIR="fastqc_results"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR" multiqc

# Run FastQC on all FASTQ files in the directory
fastqc -o "$OUTPUT_DIR" -t 40 "$FASTQ_DIR"/*.fastq

# Run MultiQC to combine the FastQC reports
multiqc "$OUTPUT_DIR" -o multiqc

3. Trimming data

According to the protocol, the next step in the analysis is trimming the adapters.

The sequencing was performed using BGIseq (DNBseq) technology and libraries were constructed following BGI protocols, the adapters you need to trim will likely be specific to BGI’s sequencing platforms. I have found out that adapter for forward strand is AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA and for backward is AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG. I am going to test it using the following script. adapter_test.sh:

#!/bin/bash

# Adapter sequences
FORWARD_ADAPTER="AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"
REVERSE_ADAPTER="AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"
FASTQ_DIR="path/to/fastq_files"

# Loop through all FASTQ files in the specified directory
for FASTQ_FILE in $FASTQ_DIR/*.fastq; do
    # Determine if the file is forward or reverse based on its name
    if [[ $FASTQ_FILE =~ _1.fastq$ ]]; then
        # Search for the forward adapter
        ADAPTER_SEQ=$FORWARD_ADAPTER
        COUNT=$(grep -o "$ADAPTER_SEQ" "$FASTQ_FILE" | wc -l)
        echo "Found $COUNT instances of the forward adapter sequence in $FASTQ_FILE"
    elif [[ $FASTQ_FILE =~ _2.fastq$ ]]; then
        # Search for the reverse adapter
        ADAPTER_SEQ=$REVERSE_ADAPTER
        COUNT=$(grep -o "$ADAPTER_SEQ" "$FASTQ_FILE" | wc -l)
        echo "Found $COUNT instances of the reverse adapter sequence in $FASTQ_FILE"
    else
        echo "Skipping $FASTQ_FILE: not recognized as forward or reverse."
    fi
done

output of testing:


Found 6058 instances of the forward adapter sequence in fastq/SRR21190736_1.fastq
Found 928 instances of the reverse adapter sequence in fastq/SRR21190736_2.fastq
Found 5027 instances of the forward adapter sequence in fastq/SRR21190737_1.fastq
Found 752 instances of the reverse adapter sequence in fastq/SRR21190737_2.fastq
Found 5326 instances of the forward adapter sequence in fastq/SRR21190738_1.fastq
Found 881 instances of the reverse adapter sequence in fastq/SRR21190738_2.fastq
Found 3339 instances of the forward adapter sequence in fastq/SRR21190739_1.fastq
Found 600 instances of the reverse adapter sequence in fastq/SRR21190739_2.fastq
Found 5090 instances of the forward adapter sequence in fastq/SRR21190740_1.fastq
Found 876 instances of the reverse adapter sequence in fastq/SRR21190740_2.fastq
Found 6257 instances of the forward adapter sequence in fastq/SRR21190741_1.fastq
Found 1025 instances of the reverse adapter sequence in fastq/SRR21190741_2.fastq
Found 3346 instances of the forward adapter sequence in fastq/SRR21190742_1.fastq
Found 553 instances of the reverse adapter sequence in fastq/SRR21190742_2.fastq
Found 4611 instances of the forward adapter sequence in fastq/SRR21190743_1.fastq
Found 663 instances of the reverse adapter sequence in fastq/SRR21190743_2.fastq
Found 4848 instances of the forward adapter sequence in fastq/SRR21190744_1.fastq
Found 858 instances of the reverse adapter sequence in fastq/SRR21190744_2.fastq
Found 4369 instances of the forward adapter sequence in fastq/SRR21190745_1.fastq
Found 1021 instances of the reverse adapter sequence in fastq/SRR21190745_2.fastq
Found 7493 instances of the forward adapter sequence in fastq/SRR21190746_1.fastq
Found 1434 instances of the reverse adapter sequence in fastq/SRR21190746_2.fastq
Found 5330 instances of the forward adapter sequence in fastq/SRR21190747_1.fastq
Found 982 instances of the reverse adapter sequence in fastq/SRR21190747_2.fastq
Found 6193 instances of the forward adapter sequence in fastq/SRR21190748_1.fastq
Found 1116 instances of the reverse adapter sequence in fastq/SRR21190748_2.fastq
Found 5028 instances of the forward adapter sequence in fastq/SRR21190749_1.fastq
Found 905 instances of the reverse adapter sequence in fastq/SRR21190749_2.fastq
Found 5482 instances of the forward adapter sequence in fastq/SRR21190750_1.fastq
Found 869 instances of the reverse adapter sequence in fastq/SRR21190750_2.fastq
Found 4665 instances of the forward adapter sequence in fastq/SRR21190751_1.fastq
Found 689 instances of the reverse adapter sequence in fastq/SRR21190751_2.fastq
Found 4057 instances of the forward adapter sequence in fastq/SRR21190752_1.fastq
Found 770 instances of the reverse adapter sequence in fastq/SRR21190752_2.fastq
Found 5233 instances of the forward adapter sequence in fastq/SRR21190753_1.fastq
Found 909 instances of the reverse adapter sequence in fastq/SRR21190753_2.fastq
Found 3782 instances of the forward adapter sequence in fastq/SRR21190754_1.fastq
Found 749 instances of the reverse adapter sequence in fastq/SRR21190754_2.fastq
Found 3413 instances of the forward adapter sequence in fastq/SRR21190755_1.fastq
Found 633 instances of the reverse adapter sequence in fastq/SRR21190755_2.fastq
Found 2812 instances of the forward adapter sequence in fastq/SRR21190756_1.fastq
Found 358 instances of the reverse adapter sequence in fastq/SRR21190756_2.fastq
Found 1054 instances of the forward adapter sequence in fastq/SRR21190757_1.fastq
Found 113 instances of the reverse adapter sequence in fastq/SRR21190757_2.fastq
Found 2271 instances of the forward adapter sequence in fastq/SRR21190758_1.fastq
Found 291 instances of the reverse adapter sequence in fastq/SRR21190758_2.fastq
Found 2866 instances of the forward adapter sequence in fastq/SRR21190759_1.fastq
Found 365 instances of the reverse adapter sequence in fastq/SRR21190759_2.fastq
Found 2019 instances of the forward adapter sequence in fastq/SRR21190760_1.fastq
Found 227 instances of the reverse adapter sequence in fastq/SRR21190760_2.fastq
Found 1951 instances of the forward adapter sequence in fastq/SRR21190761_1.fastq
Found 187 instances of the reverse adapter sequence in fastq/SRR21190761_2.fastq
Found 1723 instances of the forward adapter sequence in fastq/SRR21190762_1.fastq
Found 207 instances of the reverse adapter sequence in fastq/SRR21190762_2.fastq
Found 2604 instances of the forward adapter sequence in fastq/SRR21190763_1.fastq
Found 333 instances of the reverse adapter sequence in fastq/SRR21190763_2.fastq
Found 4818 instances of the forward adapter sequence in fastq/SRR21190764_1.fastq
Found 794 instances of the reverse adapter sequence in fastq/SRR21190764_2.fastq
Found 6579 instances of the forward adapter sequence in fastq/SRR21190765_1.fastq
Found 1071 instances of the reverse adapter sequence in fastq/SRR21190765_2.fastq
Found 6366 instances of the forward adapter sequence in fastq/SRR21190766_1.fastq
Found 1169 instances of the reverse adapter sequence in fastq/SRR21190766_2.fastq
Found 5101 instances of the forward adapter sequence in fastq/SRR21190767_1.fastq
Found 920 instances of the reverse adapter sequence in fastq/SRR21190767_2.fastq
Found 8104 instances of the forward adapter sequence in fastq/SRR21190768_1.fastq
Found 1580 instances of the reverse adapter sequence in fastq/SRR21190768_2.fastq
Found 3275 instances of the forward adapter sequence in fastq/SRR21190769_1.fastq
Found 413 instances of the reverse adapter sequence in fastq/SRR21190769_2.fastq
Found 8945 instances of the forward adapter sequence in fastq/SRR21190770_1.fastq
Found 1371 instances of the reverse adapter sequence in fastq/SRR21190770_2.fastq
Found 4592 instances of the forward adapter sequence in fastq/SRR21190771_1.fastq
Found 697 instances of the reverse adapter sequence in fastq/SRR21190771_2.fastq
Found 4857 instances of the forward adapter sequence in fastq/SRR21190772_1.fastq
Found 882 instances of the reverse adapter sequence in fastq/SRR21190772_2.fastq
Found 4420 instances of the forward adapter sequence in fastq/SRR21190773_1.fastq
Found 715 instances of the reverse adapter sequence in fastq/SRR21190773_2.fastq
Found 4435 instances of the forward adapter sequence in fastq/SRR21190774_1.fastq
Found 637 instances of the reverse adapter sequence in fastq/SRR21190774_2.fastq
Found 3292 instances of the forward adapter sequence in fastq/SRR21190775_1.fastq
Found 565 instances of the reverse adapter sequence in fastq/SRR21190775_2.fastq

Looks like the adapter sequences I found are fine, so we can jump to trimming.

cutadapt.sh:

#!/bin/bash

# Adapter sequences
FORWARD_ADAPTER="AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"
REVERSE_ADAPTER="AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"

# Create a directory for the trimmed output if it doesn't exist
mkdir -p trimmed_fastq

# Loop through all the _1.fastq files in the fastq directory
for file in fastq/*_1.fastq; do
    # Identify the pair
    file1=$file
    file2=${file/_1.fastq/_2.fastq}

    # Generate output filenames
    out1=trimmed_fastq/$(basename "$file1" .fastq)_trimmed.fastq
    out2=trimmed_fastq/$(basename "$file2" .fastq)_trimmed.fastq

    # Run Cutadapt
    cutadapt \
        -a "$FORWARD_ADAPTER" -A "$REVERSE_ADAPTER" \
        -q 30 --pair-filter=any --minimum-length=30 \
        -o "$out1" -p "$out2" \
        "$file1" "$file2"
done

-q 30: This option trims low-quality ends from reads before adapter removal. The number 30 specifies the quality score threshold. In this case, bases with a quality score lower than 30 will be trimmed from the end of each read.

–pair-filter=any: This is used with paired-end reads. It indicates that if either the forward or reverse read of a pair needs to be discarded (for example, if after trimming it’s too short), then both reads of the pair will be discarded,thus ensuring that the output maintains properly paired reads.

–minimum-length=30: This specifies the minimum length of reads to keep after trimming. Reads that are shorter than 30 bases after trimming will be discarded. This is useful for filtering out very short sequences that may not be useful for further analysis.

4. Removal of residual rRNA annotated in SILVA with BBDuk

The next step according to the protocol from the paper is to get rid of residual rRNA in our samples. This involves:

Downloading SSU (Small Subunit) and LSU (Large Subunit) Ref NR (Non-Redundant) datasets from silva.

This can be done via silva_database_download.sh:

#!/bin/bash

# Downloading the SILVA LSU and SSU reference FASTA files
wget https://ftp.arb-silva.de/release_119.1/Exports/SILVA_119_LSURef_tax_silva.fasta.gz
wget https://ftp.arb-silva.de/release_119.1/Exports/SILVA_119.1_SSURef_Nr99_tax_silva.fasta.gz

# Decompressing the downloaded files
gunzip SILVA_119_LSURef_tax_silva.fasta.gz
gunzip SILVA_119.1_SSURef_Nr99_tax_silva.fasta.gz

And running bbduk.sh to actually remive residual rRNA:

#!/bin/bash

# Directory paths
TRIMMED_FASTQ_DIR="trimmed_fastq"
RRNA_REMOVED_DIR="rRNA_removed_fastq"
TEMP_DIR="temp_rRNA_removed"

# Reference databases (ensure these are in FASTA format)
LSU_DB="SILVA_119_LSURef_tax_silva.fasta"
SSU_DB="SILVA_119.1_SSURef_Nr99_tax_silva.fasta"

# Number of threads to use
THREADS=32

mkdir -p "$RRNA_REMOVED_DIR" "$TEMP_DIR"

# Function to run BBDuk for rRNA removal
run_bbduk() {
    local in1=$1
    local in2=$2
    local out1=$3
    local out2=$4
    local ref=$5

    bbduk.sh \
        in="$in1" \
        in2="$in2" \
        out="$out1" \
        out2="$out2" \
        ref="$ref" \
        k=31 \
        ktrim=n \
        mcf=0.5 \
        tbo \
        tpe \
        threads="$THREADS"
}

# Loop through all the trimmed FASTQ files in pairs
for file1 in "$TRIMMED_FASTQ_DIR"/*_1_trimmed.fastq; do
    file2="${file1%_1_trimmed.fastq}_2_trimmed.fastq"
    
    # Intermediate and final output filenames
    temp1="$TEMP_DIR/$(basename "$file1")"
    temp2="$TEMP_DIR/$(basename "$file2")"
    final1="$RRNA_REMOVED_DIR/$(basename "$file1" .fastq)_rRNA_removed.fastq"
    final2="$RRNA_REMOVED_DIR/$(basename "$file2" .fastq)_rRNA_removed.fastq"
    
    # Step 1: Remove LSU rRNA
    echo "Removing LSU rRNA for: $(basename "$file1") and $(basename "$file2")"
    run_bbduk "$file1" "$file2" "$temp1" "$temp2" "$LSU_DB"
    
    # Step 2: Remove SSU rRNA using the output from step 1 as input
    echo "Removing SSU rRNA for: $(basename "$file1") and $(basename "$file2")"
    run_bbduk "$temp1" "$temp2" "$final1" "$final2" "$SSU_DB"
    
    # Optional: Remove intermediate files
    rm "$temp1" "$temp2"
    
    echo "SSU and LSU rRNA removal done for: $(basename "$file1") and $(basename "$file2")"
done

echo "rRNA removal process completed."

in=“$in1”: Specifies the input file for the first mate in paired-end reads.

in2=“$in2”: Specifies the second mate in paired-end reads.

out=“$out1”: Specifies the output file for processed first mates.

out2=“$out2”: Specifies the output file for processed reads from the second input file in paired-end sequencing.

ref=“$ref”: This parameter specifies the reference sequences to be used for trimming or filtering the reads..

k=31: Defines the k-mer length to be used for matching against the reference sequences. This is important for the specificity and sensitivity of the matching process.

ktrim=n: Indicates that k-mers found in reads should not be trimmed off. Instead, reads will be left intact even if matching k-mers are found.

mcf=0.5: Sets the minimum coverage filter for trimming. It specifies the minimum fraction of bases that must match in a k-mer for it to be considered a valid match.

tbo: Stands for “trim by overlap”. This option enables trimming of adapters based on paired-end read overlap detection, which can improve adapter removal when adapter sequences are not known or specified.

tpe: Enables “trim paired ends”, which means that if adapters were trimmed from either read in a pair, the other read is trimmed by the same amount.

Now we can run multiqc again.

processed_multiqc.sh:

#!/bin/bash

# Directory containing FASTQ files
FASTQ_DIR="rRNA_removed_fastq/"

# Output directory for FastQC results
OUTPUT_DIR="processed_fastqc_results"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR" multiqc

# Run FastQC on all FASTQ files in the directory
fastqc -o "$OUTPUT_DIR" -t 40 "$FASTQ_DIR"/*.fastq

# Run MultiQC to aggregate the FastQC reports
multiqc "$OUTPUT_DIR" -o processed_multiqc

5. Alignment with star

First, we should download hg38 assembly (used in the paper).

hg38.sh:

#!/bin/bash

wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Create an index with index.sh:

#!/bin/bash

GENOME_FASTA="Homo_sapiens.GRCh38.dna.primary_assembl$
ANNOTATION_GTF="GSE211993_gencode_fantomcat.v1.02.gen$
STAR_INDEX_DIR="star_index"

mkdir -p "$STAR_INDEX_DIR"

# Run STAR in genomeGenerate mode
STAR --runThreadN 32 \
    --runMode genomeGenerate \
    --genomeDir "$STAR_INDEX_DIR" \
    --genomeFastaFiles "$GENOME_FASTA" \
    --sjdbGTFfile "$ANNOTATION_GTF" \  
    --sjdbOverhang 99 
    
echo "Indexing complete."

–runThreadN 32: Number of threads

–runMode genomeGenerate: This tells STAR to run in genome indexing mode.

–genomeDir “$STAR_INDEX_DIR”: Specifies the directory where the generated genome indices will be stored.

–genomeFastaFiles “$GENOME_FASTA”: Specifies the path to the FASTA files containing the reference genome sequences.

–sjdbGTFfile “$ANNOTATION_GTF”: Specifies the path to the GTF file containing the genome annotation.

–sjdbOverhang 99: This parameter sets the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. The value should be set to the read length minus 1. In this case, 99 suggests that the read length is expected to be 100 bases.

run STAR with star.sh:

#!/bin/bash

# Define the number of threads to use
THREADS=32

# Path to the STAR genome index
GENOME_DIR="star_index"

# Directory containing rRNA removed FASTQ files
FASTQ_DIR="rRNA_removed_fastq"

# Output directory for the alignments
ALIGNMENT_DIR="alignments"
mkdir -p "$ALIGNMENT_DIR"

# Loop through the FASTQ files in pairs
for R1 in "$FASTQ_DIR"/*_1_trimmed_rRNA_removed.fastq; do
    # Infer the name of the R2 file from the name of the R1 file
    R2="${R1%_1_trimmed_rRNA_removed.fastq}_2_trimmed_rRNA_removed.fastq"

    # Extract the base name for output naming
    SAMPLE_NAME=$(basename "$R1" _1_trimmed_rRNA_removed.fastq)

    echo "Processing $SAMPLE_NAME"

    # Run STAR for alignment
    STAR --runMode alignReads \
         --runThreadN $THREADS \
         --genomeDir "$GENOME_DIR" \
         --readFilesIn "$R1" "$R2" \
         --outFileNamePrefix "${ALIGNMENT_DIR}/${SAMPLE_NAME}." \
         --outSAMtype BAM SortedByCoordinate

    echo "Alignment completed for $SAMPLE_NAME"
done

echo "All sample alignments have been completed."

–runMode alignReads: This specifies the operation mode of STAR. alignReads tells STAR to perform the read alignment against the prepared genome index.

–runThreadN Number of threads

–genomeDir “$GENOME_DIR”: This sets the path to the directory containing the genome index files that were previously generated with STAR in genomeGenerate mode.

–readFilesIn “\(R1" "\)R2”: Specifies the input files containing the RNA-seq reads.

–outFileNamePrefix “\({ALIGNMENT_DIR}/\){SAMPLE_NAME}.”: This parameter sets the prefix for all output files generated by STAR.

–outSAMtype BAM SortedByCoordinate: This tells STAR to output the alignment results in BAM format, sorted by genomic coordinates.

Checking mapping with MultiQC. mapping_quality.sh:

#!/bin/bash

# Define the directory containing STAR output files
STAR_OUTPUT_DIR="alignments"

# Define the directory where you want to save the MultiQC report
REPORT_DIR="$(pwd)/star_multiqc_report"

# Run MultiQC on the STAR outputs
echo "Running MultiQC on the STAR outputs..."
multiqc $STAR_OUTPUT_DIR -o $REPORT_DIR

Output:

6. FeatureCounts

featurecounts.sh:

#!/bin/bash

# Path to the directory containing BAM files
BAM_DIR="alignments"

# Path to the GTF annotation file
ANNOTATION_FILE="GSE211993_gencode_fantomcat.v1.02.genes_transcripts_exons.gtf"

# Output directory for featureCounts results
OUTPUT_DIR="featureCounts_results"
mkdir -p "$OUTPUT_DIR"

# Output file name
OUTPUT_FILE="${OUTPUT_DIR}/featureCounts_results.txt"

# Run featureCounts
featureCounts -a "$ANNOTATION_FILE" \
              -o "$OUTPUT_FILE" \
              -F GTF \
              -t exon \
              -g gene_id \
              -O \
              --minOverlap 90 \
              --fracOverlap 0.9 \
              --primary \
              -p \
              -B \
              -C \
              -T 32 \
              "$BAM_DIR"/*.bam

echo "featureCounts has completed."

-a “$ANNOTATION_FILE”: Specifies the annotation file.

-o “$OUTPUT_FILE”: Designates the output file that will contain the counts of reads for each feature.

-F GTF: Indicates the format of the annotation file. In this case, it’s specified as GTF.

-t exon: Specifies the type of feature to count reads against. Here, exon means that reads will be counted if they map to exon regions defined in the annotation file.

-g gene_id: This option sets the attribute in the annotation file that will be used to group features (e.g., exons) into meta-features (e.g., genes) for counting.

-O: Allows reads to be assigned to more than one feature if they overlap multiple features. This is important for reads that might map to regions where features overlap (e.g., alternative splicing sites).

–minOverlap 90: Sets the minimum number of overlapping bases required for a read to be considered as overlapping with a feature. In this case, a read must overlap a feature by at least 90 bases to be counted.

–fracOverlap 0.9: Specifies the minimum fraction of overlapping bases in a read that must overlap with a feature for the read to be considered.

–primary: Only counts primary alignments of reads. This is used to exclude secondary or supplementary alignments, which is important for ensuring that each read is counted only once, especially in the presence of multimapping reads.

-p: Indicates that the input data are paired-end reads. This option tells featureCounts to count fragments (pair of reads) instead of individual reads, which is the correct approach for paired-end RNA-seq data.

-B: Requires both ends of a pair to be successfully aligned for a fragment to be considered. This helps ensure that only high-quality, properly aligned read pairs are counted.

-C: Excludes reads that have their two ends mapping to different chromosomes or mapping to the same chromosome but on opposite strands, to filter out chimeric or improperly paired reads.

-T 32: Specifies the number of threads to use for processing.

“$BAM_DIR”/*.bam: Specifies the input BAM files containing the aligned reads.

7. Gene count summary

Summary plot is interactive and can be zoomed in/out

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Read the gene count summary and metadata files using base R functions
gene_count_summary <- read.table("count_table.txt.summary", header = TRUE, stringsAsFactors = FALSE)
metadata <- read.csv("metadata.txt", stringsAsFactors = FALSE)

# Efficient mapping and column name update with vectorized operations
metadata$genotype <- gsub("- gene corrected", "GC", metadata$genotype)
srr_to_genotype <- setNames(metadata$genotype, metadata$Run)

# Update column names in gene_count_summary based on SRR numbers, efficiently
colnames(gene_count_summary)[-1] <- sapply(colnames(gene_count_summary)[-1], function(name) {
  srr_number <- regmatches(name, regexpr("SRR[0-9]+", name))
  if(length(srr_number) > 0 && !is.na(srr_to_genotype[srr_number])) srr_to_genotype[srr_number] else name
})

# Reorder the dataframe columns based on sorted names directly
gene_count_summary <- gene_count_summary[, c("Status", sort(names(gene_count_summary)[-1]))]

# Filter rows before melting to reduce the amount of data processed
gene_count_summary_filtered <- gene_count_summary %>% 
  filter(Status %in% c("Assigned", "Unassigned_MultiMapping", "Unassigned_NoFeatures", "Unassigned_Overlapping_Length"))

# Melt the data
gene_count_long <- melt(gene_count_summary_filtered, id.vars = "Status", variable.name = "Sample", value.name = "Count")

# Create a unique identifier for each replicate
gene_count_long <- gene_count_long %>%
  group_by(Sample) %>%
  mutate(Replicate = paste(Sample, row_number(), sep = "_")) %>%
  ungroup()

# Plotting
p <- ggplot(gene_count_long, aes(y = Sample, x = Count, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge", orientation = "y") +
  theme_minimal() +
  labs(title = "Counts for Each Replicate and Status", y = "Replicate", x = "Count") +
  scale_fill_brewer(palette = "Set3") # Adjust palette as needed

# Convert ggplot object to an interactive plotly object
ggplotly(p)

Now lets look at the count table

# Read the gene count summary and metadata files using base R functions
count_table <- read.table("count_table.txt", header = TRUE, stringsAsFactors = FALSE)

# Efficient mapping and column name update with vectorized operations
metadata$genotype <- gsub("- gene corrected", "GC", metadata$genotype)
srr_to_genotype <- setNames(metadata$genotype, metadata$Run)

# Update column names based on SRR numbers, efficiently
colnames(count_table)[7:ncol(count_table)] <- sapply(colnames(count_table)[7:ncol(count_table)], function(name) {
  srr_number <- regmatches(name, regexpr("SRR[0-9]+", name))
  if(length(srr_number) > 0 && !is.na(srr_to_genotype[srr_number])) {
    srr_to_genotype[srr_number]
  } else {
    name
  }
})


# Extract the gene identifiers (or names) to use as row names
gene_ids <- count_table[, 1]

# Extract the count data, excluding the first column and non-count columns
count_data <- count_table[, 7:ncol(count_table)] # Adjust indices as necessary

# Convert to matrix if it's not already one
count_data_matrix <- as.matrix(count_data)

# Assign gene identifiers as row names to the matrix
rownames(count_data_matrix) = gene_ids

print(count_data_matrix[1:4, ])
##                   L150P GC L150P GC.1 L150P GC.2 L150P GC.3 L150P GC.4
## CATG00000000004.1        0          0          0          0          1
## CATG00000000008.1        0          2          0          2          1
## CATG00000000011.1        4         12          9          4         10
## CATG00000000025.1        0          0          0          0          0
##                   L150P fAD L150P fAD.1 L150P fAD.2 L150P fAD.3 L150P fAD.4
## CATG00000000004.1         0           0           0           0           0
## CATG00000000008.1         0           0           0           0           0
## CATG00000000011.1        10          14           8          10           8
## CATG00000000025.1         1           0           0           0           0
##                   A79V GC A79V GC.1 A79V GC.2 A79V GC.3 A79V GC.4 A79V fAD
## CATG00000000004.1       2         2         2         0         0        0
## CATG00000000008.1       0         0         0         0         0        0
## CATG00000000011.1      19        20         4         6         7       12
## CATG00000000025.1       0         0         4         0         2        0
##                   A79V fAD.1 A79V fAD.2 A79V fAD.3 A79V fAD.4 L150P GC.5
## CATG00000000004.1          0          2          2          2          0
## CATG00000000008.1          0          0          4          0          0
## CATG00000000011.1         20          4         11         11          0
## CATG00000000025.1          0          2          0          0          0
##                   L150P GC.6 L150P GC.7 L150P GC.8 L150P GC.9 L150P fAD.5
## CATG00000000004.1          0          1          0          0           0
## CATG00000000008.1          0          0          0          0           0
## CATG00000000011.1          0          0          0          0           0
## CATG00000000025.1          0          0          0          0           0
##                   L150P fAD.6 L150P fAD.7 L150P fAD.8 L150P fAD.9 A79V GC.5
## CATG00000000004.1           0           0           0           0         1
## CATG00000000008.1           0           0           0           0         2
## CATG00000000011.1           0           0           0           0         0
## CATG00000000025.1           0           0           0           0         4
##                   A79V GC.6 A79V GC.7 A79V GC.8 A79V GC.9 A79V fAD.5 A79V fAD.6
## CATG00000000004.1         0         0         2         0          0          0
## CATG00000000008.1         0         0         2         0          2          0
## CATG00000000011.1         2         0         2         0          0          0
## CATG00000000025.1         0         5         0         2          0          2
##                   A79V fAD.7 A79V fAD.8 A79V fAD.9
## CATG00000000004.1          0          2          4
## CATG00000000008.1          0          0          0
## CATG00000000011.1          0          0          0
## CATG00000000025.1          0          0          0